On Entropy-Compressed Text Indexing in External Memory

نویسندگان

  • Wing-Kai Hon
  • Rahul Shah
  • Sharma V. Thankachan
  • Jeffrey Scott Vitter
چکیده

A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994]. However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. [2008] gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure. Given a text T of length n drawn from a σ-sized alphabet set, they achieved O(n log σ)-bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes O(n(Hk + 1)) + o(n log σ) bits of space where Hk is the kth-order empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Algorithmic Framework for Compression and Text Indexing

We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lower-order terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hth-order empirical entropy of the text, Hh. In particular, we provide a tight analysis of the Burrows-Wheeler transform (bwt) establishing a bou...

متن کامل

Compression, Indexing, and Retrieval for Massive String Data

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compres...

متن کامل

Entropy-Compressed Indexes for Multidimensional Pattern Matching

In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...

متن کامل

Practical Dynamic Entropy-Compressed Bitvectors with Applications

Succinct/compressed data structures aim at providing the same functionality offered by classical data structures while using asymptotically less space. There exist several of these structures for a wide spectrum of applications ranging from strings over arbitrary alphabets to full-text indexing. Their theoretical promises have been met in practice in the static scenario, while the practicality ...

متن کامل

Self-Indexing XML

Self-indexing is a technology that integrates text compression and text indexing, such that a text collection can be simultaneously compressed and indexed. The resulting representation, called a self-index of the text, takes space close to that of the compressed text, is able of reproducing any text substring, and oers indexed searching of the collection. This has been a major breakthrough in t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009